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_ _i ' Abstract 

We study upper and lower bounds on the sample-complexity of learning near- 
optimal behaviour in finite-state discounted Markov Decision Processes (MDPs). For 
the upper bound we make the assumption that each action leads to at most two 
possible next-states and prove a new bound for a UCRL-style algorithm on the number 
of time-steps when it is not Probably Approximately Correct (PAC). The new lower 
bound strengthens previous work by being both more general (it applies to all policies) 
and tighter. The upper and lower bounds match up to logarithmic factors. 
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1 Introduction 



The goal of reinforcement learning is to construct algorithms that learn to act optimally, 
or nearly so, in unknown environments. In this paper we restrict our attention to finite 
state discounted MDPs with unknown transitions. The performance of reinforcement 
learning algorithms in this settin g can be measu red in a number of ways, for instance 
by using regret or PAC bounds jKakadel l2003h . We focus on the latter, which is a 
measure of the number of time-steps where an algorithm is not near-optimal with high 
probability. Many previous algorithms have been shown to b e PAC with varying bounds 
(iKakadel. 120031: IStrehl and Littmanl . 120051 : IStrehl et all l200(il . 120091 : ISzita and Szepesvaril . 
20101 : lAuerl . l201lh . 

We modify t he Upp e r Co n fidence Reinforcement Lea rning (UCRL) algorithm of 
Auer et al.1 (j2010h : lAuerl (j201lh : IStrehl and Liftman (|2008l ) and, under the assumption 
that there are at most two possible next-states for each state/action pair, prove a PAC 
bound of 



O 



\S x A\ 



log- 



This bound is an improvem ent^] on the previous best (|Auerl . l201lh and published best 
( Szita and Szepesvari . 2010l ). which 



arc 



O 



\S x A\ 
e 2 (l-7)' 



log 



and 



O 



\S x A\ 
e 2 (l-7) f 



log- 



respectively. The additional assumption is unfortunate and is probably unnecessary as 
discussed in Section [6l 

We also present a matching (up to logarithmic factors) lower bou nd that is both larger 
and more general than the previous best given bv lStrehl etaD (l2009h . The class of MDPs 
used in the counter-example satisfy the assumption used in the upper bound. 



2 Notation 

Unfortunately, we found it impossible to reduce the amount of notation and number of 
constants. While we have endeavoured to define everything before we use it, readers are 
encouraged to consult the tables of notation and constants found in the appendix. 

General. N = {0, 1,2, • • • } is the natural numbers. For the indicator function we write 
{x = yj = 1 if x = y and if x ^ y. We use A and V for logical and/or respectively. If A 
is a set then |^4| is its size and A* is the set of all finite ordered subsets. Unless otherwise 
mentioned, log represents the natural logarithm. For random variable X we write EX 
and VarX for its expectation and variance respectively. We make frequent use of the 
progression z% = 2' — 2 for i > 1. Define a set 2(a) := {zi : 1 < i < argmin^ {zi > a}}. 

Markov Decision Process. An MDP is a tuple M = (S, A,p,r,j) where S and A 
are finite sets of states and actions respectively, r : S [0, 1] is the reward function. 
p : S x A x S — >• [0, 1] is the transition function and 7 G (0, 1) the discount rate. A 

1 In this slightly restricted setting. 
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stationary policy tt is a function tt : S — > A mapping a state to an action. We write p s s ' a as 
the probability of moving from state s to s' when taking action a and p s sn := , The 

value of policy tt in M and state s is V^(s) := r(s) + 7 Xls'g5^s,7r^M(' s ')- We view V^- 
either as a function : S 1 — > R or a vector G and similarly p s a G [0, is a 
vector. The optimal policy of M is defined ir* M := argmax^y^. Common MDPs are M, 
M and M, which represent the true MDP, the estimated MDP using empirical transition 
probabilities and a model. We write V := Vm, V := Vj^ and V := Vj^ for their values 
respectively. Similarly, tt* := ir^ and in general, variables with an MDP as a subscript 
will be written with a hat, tilde or nothing as appropriate and the subscript omitted. 



3 Estimation 



In the next section we will introduce the new algorithm, but first we give an intuitive intro- 
duction to the type of parameter estimation required to prove sample-complexity bounds 
for MDPs. The general idea is to use concentration inequalities to show the empiric esti- 
mate of a transition probability approaches the true probability exponentially fast in the 
number of samples gathered. There are a wide variety of concentration inequalities, each 
catering to a slightly different purpose. We improve on previous work by using Bernstein's 
inequality, which takes variance into account (unlike Hoeffding). The following example 
demonstrates the need for Bernstein's inequality when estimating the value functions of 
MDPs. It also gives insight into the workings of the proof in the next two sections. 

Consider the Markov reward process on the right with two states 
where rewards are shown inside the states and transition probabil- 
ities on the edges. Note this is not an MDP because there are no 
actions. We are only concerned with how well the value can be ap- 
proximated. Assume p > 7, q arbitrarily large (but not 1) and let p 
be the empiric estimate of p and consider the error in our estimated 
value and the true value while in state sq. One can show that 




V(s ) - V(s ) 



\P-P\ 
(1-7) 2 



(1) 



Therefore if V — V is to be estimated to within e accuracy, we need \p — p\ < e(l — "y) 2 . 
Now suppose we bound \p — p\ via a standard Hoeffding bound, then with high probability 
\P~ p\ ~ \fL/n where n is the number of visits to state sq and L = log(l/<5). Therefore to 
obtain an error less than e(l — 7) 2 we need n > ^73^3 visits to state sq, which is already 
too many for a bound in terms of 1/(1 — 7) 3 . If Bernstein's inequality is used instead, 
then \p — p\ < y/Lp(l — p)/n and so n > ^F^_J^i is required, but Equation ([1]) depends 
on p > 7. Therefore n > gj^^p visits are sufficient. If p < 7 then Equation (QJ can be 
improved. 



4 Upper Confidence Reinforcement Learning Algorithm 

UCRL is based on the optimism principle for solving the exploration/exploitation dilemma. 
It is model-based in the sense that at each time-step the algorithm acts according to a 
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model (in this case an MDP, M) chosen from a model class. The idea is to choose the 
smallest model class guaranteed to contain the true model with high probability and act 
according to the most optimistic model within this class. With a good choice of model 
class this guarantees a policy that biases its exploration towards unknown states that 
may yield good rewards while avoiding states that are known to be bad. The approach 
has been successful in obtaining uniform sample complexity (or regre t) bounds in var 



1985; 


Agrawall. 


1995; 


Auer et al. 


Auer et al. 


.l20ld: lAuei. 2011). 



2007 



Unfortunately, to prove our new bound we needed to make an assumption about the 
transition probabilities of the true MDP. We do not believe this assumption is crucial, 
but it substantially eases the analysis by removing some dependencies in the more general 
problem. In Section [6] we present an approach to remove the assumption as well as some 
intuition into why this ought to be possible, but non-trivial. 



Assumption 1. The true unknown MDP, M, satisfies p s s a 
denoted sa + ,sa~ € S$ 



for all but two s' € S 



The pseudo-code of UCRL can be found below, but first we define a knownness index, 
k. If n is the number of times a state/action pair has been visited then K,(L,n) is the 
knownness of that state/action pair at level t. The knownness of a state increases with 
the number of visits, is bounded by \S\ and is always a natural number. The reason 
for defining these now is that UCRL will only perform an update when the knownness 
index of some states would be changed by an update. Unfortunately, the definition below 
is unlikely to be very intuitive. A more thorough explanation of knownness is given in 
Section [5j 



Definition 2 (Knownness). Define constants 
e(l- 7 ) 



UK, 



4\S\ 
X:= {0,1,. 



J 



w L := 2 L w min 
K:=Z(\S\). 



log 



\\S\ 



log 2 b e(l- 7 ) 5 



We define the knownness index, n : X x N — > /C by 

k(l, n) : = max ^ z € /C : z < 

where m G O 



n 



1 i ISxAl 



is defined in Appendix [Dl 



Note that the e xistence of the funct i on Ex tended ValueIteration is proven and an 
algorithm given by Strehl and Littman ( 20081 ). 



2 Note that aa + and sa are dependent on (s, a) and are known to the algorithm. 
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Algorithm 1 UCRL 



t = 1, k = 1, n(s, a) = n(s, a, s') = for all s, a, s' and si is the start state. 
- log j^y, Li := log £ and ft := 2 | SxA pi)Cxi| 



H '■= T=^ log 7(TZ^) , ^i := log £ and Si := 
loop 

: = n(s, a)/ max{l, n(s, a, aai + )} and pf a :=l-pf* 

Mk ■= |M : -Pffal ^ CONFIDENCElNTERVAL(p^,n(s,o)), V(s,a)| 

M = Extended ValueIteration (A4 h ) 

TTfc = 7T* 

v(s, a) = u(s, a, s') — for all s, a, s' 

while k(l, n(s, a) + v(s, a)) = k(l, n(s, o)), V(s, n),i€l do 
Act 

Delay and Update 
function Delay 
for j = 1 — > H do 

Act 

function Update 

n(s, a) = n(s, a) + u(s, a) and n(s, a, s') = n(s, a, s') + u(s, a, s') Vs, a, s' and fc = fc + 1 
function Act 
at = TTfe(st) 

St+i ~ Ps t ,a t > Sample from MDP 

v(s t ,a t ) = v(s t ,a t ) + 1 and v(s u a u s t+1 ) = v(s t ,a t , s t +i) + 1 and t = t + 1 
function Extended ValueIteration ( M ) 

return optimistic M € M such that V~(s) > V~ (s) for all s 6 S and M' £ M. 



23: function ConfidenceInterval(p, n) 
24: return min 



2Lip(l-p) 2Zq 
n 3n 



5 Upper PAC Bounds 

We present two new PAC bounds. The first improves on all previous analysis, but relies 
on Assumption [TJ The second is completely general, but gains an additional dependence 
on 1 51 leading to a PAC bound in terms of \S\ 2 and 1/(1 — 7) 3 . This bound is worse than 
the previous best in terms of \S\, but better in terms 1/(1 — 7). 

Theorem 3. Let M be the true MDP satisfying Assumption^ Let ir be the actual (non- 
stationary) policy of UCRL (Algorithm 1), then V*(st) — V n (st) > e for at most 

\S x A\ , \S x A\ , o 1 ci i „2 1^1 



HU max + HE m ^ E 04- ^ log L log^ \S\ log' ' 1 - log^ log ■ 



e 2(!_ 7 )3 & fc(i_ 7 ) * b e (l- 7 ) & b l- 7 
time-steps with probability at least 1 — 5. (U m3iX and E m3iX are defined in AppendixWl ) 

Note that although ix^ is stationary, the global policy of UCRL is non-stationary. 
Despite this, we will abuse notation by allowing ourselves to write V n (st), whereas really 
V 71 should depend on the entire history. Fortunately, when UCRL is not delaying, the 
policy 7r is nearly stationary in the sense that it will be so for the next H time-steps. This 
allows us to work almost entirely with stationary policies and so discard the cumbersome 
notation required for non-stationary policies. 
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Theorem 4. Let M be the true MDP (possibly not satisfying Assumption^ then there 
exists a policy ir such that V*(st) — V 7T (s t ) > e for at most \S\ log 3 \S\(E max H + U max H) 
time-steps with probability at least 1 — 5. 

The proof of Theorem [4] is omitted, but follows easily by converting an arbitrary 
MDP with | S | states into a functionally equivalent MDP with OdS*) 2 ) states that satisfies 
Assumption [TJ This is done by adding a tree of 2\S\ states for each state/action pair and 
rescaling 7. 

Proof Overview. The proof of Theorem [3] b orrows components f r om th e work of 



Auer et al.1 (|2O10h . IStrehl and Littmanl (|2008h and lSzita and Szepesvar] (|2QlClh . 



1. Bound the number of updates by \S x A\ log 



SxA\ 



KxX\ 



, which follows from the algorithm 



and the definition of knownness. This bounds the number of delaying time-steps to 
0(j^\S x A \ log jj^ jj ) time-steps, which is insignificant from the point of view of 
Theorem [3j 



2. Show that the true MDP remains in the model class M & for all k. 

3. Use the optimism principle to show that if M G A4k and V* — V n > e then \V 7Tk — 
V 7Tk \ > e/2. This key fact shows that if it is not nearly-optimal at some time- 
step t then the true value and model value of tt^ differ and so some information is 
(probably) gained by following this policy 

4. The final component is to bound the number of time-steps when tt is not nearly- 
optimal. 



Episodes and phases. UCRL operates in episodes, which are blocks of time-steps ending 
when update is called. The length of each episode is not fixed, instead, an episode ends 
when the knownness of a state changes. We often refer to time-step t and episode k 
and unless there is ambiguity we will not define k and just assume it is the episode in 
which t resides. A delay phase is the period of H contiguous time-steps where UCRL 
is in the function DELAY, which happens immediately before an update. An exploration 
phase is a period of H time-steps starting at t where t is not in a delay phase and where 
V 7tk {st) — V 7Tk (st) > e/2. Exploration phases do note overlap. More formally, the starts 
of exploration phases, t±,t2, ■ ■ ■ , are defined inductively 

t\ := min jt : V 7Tk (s t ) - V 7Tk (s t ) > e/2 At is not in a delay phase j 

U := min ji : t > + H A V 7Tk (s t ) - V 7Tk (s t ) > e/2 At is not in a delay phase j . 

Note there need not, and with high probability will not, be infinitely many such tj. The 
exploration phases are only used in the analysis, they are not known to UCRL. 

Weights and variances. We define the weighlH of state/action pair (s, a) as follows. 
w n (s,a\s') := l(s',ir(s')) = (s,a)}+ 7 Y,Ps'Msi) w7T ( s > a \ s ") w t (s) := w* k (s,ir k (s)\s t ). 

s" 

Also called the discounted future state distribution in iKakadd (|2003h . 
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As usual, w and w are defined as above but with p replaced by p and p respectively. 
Think of wt(s) as the expected number of discounted visits to state/action pair (s,7Tfc(s)) 
while following policy ir^ starting in state sj. The important point is that this value 
is approximately equal to the expected number of visits to state/action pair (s, 7Tfc(s)) 
within the next H time-steps. We also define the local variance of the value function. 
These measure the variability of values while following policy 7r. 

^(s? ■= PS,. ■ V* 2 - [p St7T ■ V*] 2 ^(s) 2 := p St7T ■ V* 2 - \p s ^ ■ V*] 2 . 

The active set. We will shortly see that states with small wt(s) cannot influence the 
differences in value functions. Thus we define an active set of states where wt{s) is not 
tiny. At each time-step t define the active set Xt by 

f e(l-7) 
X t := is : w t (s) > — =: w m 

Knownness. We now expand on the concept of knownness and explain its purpose. We 
write n t (s, a) for the value of n(s, a) at time-step t and nt(s) := n t (s, vrfc(s)) where k is the 
episode associated with time-step t. Let t be some non-delaying time-step and suppose s 
is active (s G Xt). Now let it{s) := argmin t Wt{s) > w L and note that bt(s) € X. We define 
a partition of the active set Xt by 

K t (n, b) := {s € X t : L t (s) = t A K t (L t (s),n t (s)) = k,} . 

The set Kt{n,L) represents a set of states that have comparable weights and visit counts. 
We will show that if \Kt(K, b)\ < k for all k,l then the values V and V are reasonably 
close. This result forms a key stage in the proof of Theorem [3] because it shows that if it is 
not nearly-optimal at time-step t then there exists a K t (K, b) that is quite large and where 
states have not been visited sufficiently. Furthermore, the weights wt(s) where s € Kt{n, l) 
are large enough that some learning is expected to occur. 

Analysis. The proof of Theorem [3] follows easily from three key lemmas. 
Lemma 5. The following hold: 

1. The total number of updates is bounded by C/ max := \S x A\ log L- X J ■ 

2. If M G A4fc and t is not in a delay phase and V*{st) — V n (st) > e then 

V Tk (s t ) - V Wk (s) > e/2. 
Lemma 6. Me Mk for all k with probability at least 1 — 5/2. 

Lemma 7. The number of exploration phases is bounded by E max with probability at least 
1 - 5/2. 

The proofs of the lemmas are delayed while we apply them to prove Theorem [3l 
Proof of Theorem [3l By Lemma @ M G for all k with probability 1 — 5/2. 
By Lemma [7] we have that the number of exploration phases is bounded by £ mffl with 
probability 1 — 5/2. Now if t is not in a delaying or exploration phase and M G M k then 
by Lemma O tt is nearly-optimal. Finally note that the number of updates is bounded by 
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i7 max and so the number of time-steps in delaying phases is at most HU max . Therefore 
UCRL is nearly-optimal for all but HU max + HE max time-steps with probability 1 — 8. ■ 

We now turn our attention to proving Lemmas [6] and [JJ Of these, only Lemma [7J 
presents a substantial challenge. 

Proof of Lemma [5], For part 1 we note that for t G X the knownness of a state/action 
pair at level l satisfies k G /C. Since the knownness index for each i is non-decreasing and 
an update only occurs when an index is increased, the total number of updates is bounded 
by C/ max := |S x A\\Kxl\. 



The proof of part 2 is closely related to the approach taken by IStrehl and Liftman 



( 20081 ) . Recall that M is chosen optimistically by extended value iteration. This generates 



an MDP, M, such that > ^7/( s ) ^ or a ^ G -^ k ' Since we have assumed 

M G M k we have that V^s) = VUs) > V^(s). Therefore V 7Tk (s t )-V 7T (s t ) > e. Finally 
note that t is a non-delaying time-step and so policy tt will remain stationary and equal 
to TTk for at least H time-steps. Using the definition of the horizon, H, we have that 
\V*(s t ) - ^(sj)! < e/2. Therefore V* k (s t ) - V Kk {s t ) > e/2 as required. ■ 

Proof of Lemma [6J. In the previous lemma we showed that there are at most U max 
updates. Therefore we only need to check M £ M.^ for each k up to U mSiX . Fix an 
(s, a) pair and apply the best of either Bernstein or Hoeffding inequalities to show that 
\Pfa ~Pfa \ ^ ConfidenceInterval(j5^ - pf a , n(s , a))) with probability 1 — S^. Setting 
^! := 2 \ SxA\u — ^ 2\SxA[ 2 \K.xi\ an( ^ a PPlyi n S the union bound completes the proof. ■ 
We are now ready to work on Lemma [7J The proof follows from two lemmas: 

1. If t is the start of an exploration phase then there exists a (k, l) such that \Kt(n, l)\ > 

K. 

2. If \Kf(K, l)\ > k for sufficiently many t then sufficient information is gained that 
some state/action pair must have an increase in knownness. 

Lemma 8. Let t be a non-delaying time-step and assume M G Mk- If |-Kt( K >0l — K f or 
all K,LEtC then \V nk {s t ) - V nk {s t )\ < e/2. 

The full proof is long, technical and has been relegated to Appendix [Bj We provide a 
sketch, but first we need some useful results about MDPs and the differences in value 
functions. 

Lemma 9. Let M and M be two Markov decision processes differing only in transition 
probabilities and tt be a stationary policy then 

V*(s t )-V*(s t ) = 75>t(s)(p Sl7 r-i5 S!7r )-y\ 

s 

Proof sketch. Drop the tt superscript and write V(st) = r(st) + 7^ s 1 Ps t unV(st+i). 

Then V(s t ) - V(s t ) = j\p St ,n ~ VhA • V + 7 E St+1 pZ^ [V(s t +i) - V(s t+1 )]. The result 
is obtained by continuing to expand the second term of the right hand side. ■ 
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Lemma 10. If M € Mk at time-step t and V := V 7Tk then 



\(Ps,n k -Ps,w k ) -V\< 



where a nk (s) 2 := p s ^ a ■ V' 



n t {s) 
~i 2 



+ 



1-7 \n t (s) 



3/4 



+ 



4Li 



3n t (s)(l- 7 ) : 



■ V 



The idea is to note that M,M are in Mk and apply the definition of the confidence 
intervals. The full proof is subsumed in the proof of the more general Lemma [33] in 
Appendix O The following lemma bounds the expected total discounted local variance. 

Lemma 11. For any stationary it and M , ^2 seS Wt(s)& n (s) 2 < ^jz^yi- 

See the paper of Sobell ( 1982 ) for a proof. 
Proof sketch of Lemma [5J For ease of notation we drop references to tt^- We approxi- 
mate w(s) ?» w(s) and \{p. 



v\ < 



Ll n(s) " Using Lemma [9] 



\V(s t )-V(s t )\ 



■V 




< 



E 

sex 



w t {s){p s ^ k -p s ,7T k ) ■ V 



E 



Liw t (s)a(s) 2 



Km 



w t {s)a(s) 2 < 



1 L X \K xl\ 
mj 2 (l — 7) 5 



(2) 



(3) 



(4) 



where in Equation ([2]) we used Lemma [9] and the fact that states not in X are visited very 
infrequently. In Equation ([3]) we used the approximations for (p — p) ■ V, the definition 
of K(k,l) and the approximation w ~ w. In Equation (j4| we used the Cauchy-Schwartz 
inequality^ the fact that k > \K(k,l)\ and Lemma [TTl Substituting m := 2 ^^ K ^2+2/l 
completes the proof. The extra terms in m are needed to cover the errors in the approxi- 
mations made here. ■ 

The full proof requires formalising the approximations made at the start of the sketch 
above. The second approximation is comparatively easy while the showing that w(s) ~ 
w(s) requires substantial work. 

The following lemmas are used to show that \Kt{n,i)\ cannot be larger than n for 
too many time-steps with high probability. Combined with Lemma [8] above this will be 
sufficient to bound the number of exploration phases. Let t be the start of an exploration 
phase and define vt{s) to be the number of visits to state s within the next H time-steps. 
Formally, u t {s) := £-if ^bt = 4- 

Lemma 12. Let t be the start of an exploration phase and wt(s) > w m i n then Eft(s) > 
w t (s)/2. 



'ICMI < l|i|| 2 IMI 



9 



Proof sketch. Use the definition of the horizon to show that wt(s) is not much larger 
than a bounded-horizon version. Compare Ez^(s, ^t(s)) and the definition of wt(s). ■ 



Lemma 13. Let N be as in Appendix [Dl If [K^fe, i)\ > k for AN exploration phases 
ti,t2, ■ ■ ■ ,*4iv then Ya=i ^2se.K t .(K,i) v u{s,k{s)) > Nkw u with probability at least 1 — <5i. 

Proof. As in the previous proof we drop ir superscripts and denote K{ := (k, i). Define 



Now \Ki\ > k and so by Lemma [12] we have Ei/j > kw l /2. We now prepare to use 
Bernstein's inequality. Let Xj = z/j — Ei/j, fj, := ^? Ylt=i an d 0-2 := Jn Ei=l Var Aj 
then 



C 4iV ^| r 4iV 4iV ^| 

I i=l J I i=l i=l J 



5>-Ei*] < -^E^/2| < 2expf-— i^T" 

,i=i i=i J \ 8(7 + 3(1-7) . 



80- 2 16 
+ 



, 2 

logT-. 

01 



Setting this equal to 8% and solving for AN gives 

8a 2 + 9 

4X > B bg _ ^ 

/" 2 <*i L 3 m(!-7). 

Naively bounding a 2 //J? < 1/((1 — t)m) an d noting that /i > u> m i n /2 leads to 

1415 xAI 2 
4X > — ^log-r- 

Since AN satisfies this, the result is complete. ■ 

Proof of Lemma We proceed in two stages. First we bound the total number of 
useful visits before \K(k, t)\ < k. We then show this number of visits occurs after O(m) 
exploration phases with high probability. 

Bounding the number of useful visits. A visit to state/action pair (s,a) in time- 
step t is (k, l) -useful if k(l, rit(s, a)) = k. Fixing a (k,l) we bound the number of (k,l)- 
useful visits to state/action pair (s, a). Suppose t\ < ti and k(l, nt x (s, a)) = k and 
nt 2 (s,a) — n tl (s,a) > mw b {2n + 2) then k(l, nt 3 (s, a)) > k for all t% > t<z- Therefore for 
each (k,l) pair there at most 6\S x A\mw L K visits that are (k, i)-useful. 

Bounding the number of exploration phases. Let N := 6\S x A\m and t be the 

start of an exploration phase. Therefore V nk (st) — V 7Tk (st) > e/2 and so by Lemma [8] 
there exists a (k, l) G K, such that \S\ > \K(k, l)\ > k. If \K tt (K, l)\ > k at the start of AN 
exploration phases, t\,t2, • • • ,t4jy then by Lemma [T3l 



IN 

E 

2=1 s,a£K ti (K,i) 



^ v tt (s,a) < Nw l k\ < Si. 



Therefore by the union bound there are at most E mSuX ■= AN\fC x X\ exploration phases 
with probability 1 - 8 X \K, X 1\ = 1 - \K X I\ 2 \ S xA \ u max > 1 ~ 6 / 2 - U 
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6 Eliminating the Assumption 



The upper bound in the previous section could only be proven using Assumption [TJ In 
this section we describe a possible approach to generalising the proof and why this may 
be non-trivial. In the work above we used the assumption to bound (p s ,n ~ Ps,n ) ■ V* < 
\J Lia n (s) 2 /n. A natural approach to generalising this comes from Bernstein's inequality 
(Theorem I30p. If V* 6 M) s \ is a value function independent of p then Bernstein's inequality 
can be used to show that (p Sj7r — j5 Sj7r ) • V n < \J Lia n (s) 2 Jn. This suggests we adjust 
our model class by letting it := tt* and changing the condition to (j5 Sj7r — p s ^) ■ V* < 
y/Lia n (s) 2 /n. We might then bound (p Sj7r - j? Sj7r ) ■ V* = (p s ^ -p a ,-jr) • V* + (p s ^ - j? Sj7r ) • V*. 
The right term is then bounded by the conditions on the model class and the left term can 
perhaps be bounded by noting that p s ^ is the true probability distribution. Unfortunately, 
there are a few problems with this approach: 

1. Bounding (v s w — p s w ) ■ V* does not result in a bound in terms of a 7T (s) 2 . This issue 

' 1 ' — 2 

can be solved by again applying Bernstein's inequality to bound (p s , a — Ps,a) ' V* ■ 

2. The value V* is not in general independent of p. This is because M must be chosen 
to satisfy the conditions on (j5 Sj7r — p S)7T ) • V*, which depends on p. This dependence 
violates the conditions of Bernstein's inequality when trying to bound {p s ^— p S)7T )-V*. 
The dependence is intuitively quite weak, but nevertheless presents problems for 
rigorous proof. 

3. The last problem is that extended value iteration is no longer a trivial operation (even 
granting infinite computation). The problem is that the condition {p s ^ a — Ps,a) ' ^* 
is not local to (s,a), it also depends on the choice of p s ', a ' for G S x A. This 
complication is probably resolvable, but the formal demonstration of extended value 
iteration is no longer so easy. 

' — '2 

Progress. The first issue above can be solved, as remarked, by bounding {p s ,a—Ps,a) • V* 
using another Bernstein inequality. The problem here is that this condition must now be 
added to the definition of the model class. The second issue is non-trivial and we cannot 
claim to have made progress there. We did manage to show that extended value iteration 
can be extended to the case where the only constraints take the form (p Sj7r — ]5 Sj7r ) • V* < 
Lia n (s) 2 /n. In this case it can be shown the existence of a globally optimistic MDP. 
Unfortunately if you add constraints on higher moments, (p St7T — p s tt) • ^ 2 then results 
become substantially more complex. Note that in the complete proof of Lemma [8] we used 
higher moments still, but this is not required. Lemma [8] can be proven using only bounds 
on (p — p) ■ V* and (p — p) ■ V* . 



7 Lower PAC Bound 



We no w turn our at t ention to proving a matching lower bound. The approach is similar to 
that of IStrehl et ail (j2009h . but we make two refinements to improve the bound to depend 
on 1/(1 — 7) 3 and remove the policy restrictions. The first is to add a delaying state where 
no information can be gained, but where an algorithm may still fail to be PAC. The second 
is more subtle and will be described in the proof. 
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Definition 14. A non-stationary policy is a function tt : S* — > A. 



Theorem 15. Let tt be a (possibly non- stationary) policy depending on S, A, r, 7, e and 
5, then there exists a Markov decision process Mh ar d such that V*(st) — V 7T (st) > e for at 
least N time-steps with probability at least 5 where 



N 



ci\S 



£ 2 (1 



iA\ c 2 



and ci,C2 > are independent of the policy ir as well as all inputs S, A, e, S, 7. 

The proof can found in Appendix [Aj but we give the counter-example MDP and 
intuition. 

Counter Example. We prove Theorem [15] for a class of 
MDPs where S = {0, 1, 0, 9} and A = {1, 2, • • • , \A\}. 
The rewards and transitions for a single action are 
depicted in the diagram on the right where e(a*) = 
16e(l — 7) for some a* £ A and e(a) = for all other 
actions. Some remarks: 

1. States and are almost completely absorbing 
and confer maximum/minimum rewards respec- 
tively. 

2. The transitions are independent of actions for all 
states except state 1. From this state, actions lead 
uniformly to 0/0 except for one action, a* , which 
has a slightly higher probability of transitioning to 
state ©. Thus a* is the optimal action in state 1. 

3. State has an absorption rate such that, on aver- 
age, a policy will stay there for 1/(1—7) time-steps. 




P :=l/(2-7) 
Figure 1: Hard MDP 



Intuition. The MDP above is very bandit-like in the sense that once a policy reaches 
state 1 it should choose the action most likely to lead to state whereupon it will either 
be rewarded or punished (visit state © or 0). Eventually it will return to state 1 when 
the whole process repeats. This suggests a PAC-MDP algorithm c an be used to learn the 
bandi t with p(a) := pf a - We can then make use of a theorem of iMannor and Tsitsiklisl 
(|2004l ) on bandit sample-complexity to show that the number of times a* is not selected 
is at least 



O 



1 



*(l- 7 ) 



(5) 



Improving the bound to depend on 1/(1 — 7) 3 is intuitively easy, but technically somewhat 
annoying. The idea is to consider the value differences in state as well as state 1. State 
has the following properties: 

1 . The absorption rate is sufficiently large that any policy remains in state for around 
1/(1 — 7) time-steps. 
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2. The absorption rate is sufficiently small that the difference in values due to bad 
actions planned in state 1 still matter while in state 0. 



While in state an agent cannot make an error in the sense that V*(0) — Q*(0,a) = 
for all a. But we are measuring V*(0) — U^O) and so an agent can be penalised if its 
policy upon reaching state 1 is to make an error. Suppose the agent is in state at some 
time-step before moving to state 1 and making a mistake. On average it will stay in state 
for roughly 1/(1 — 7) time-steps during which time it will plan a mistake upon reaching 
state 1. Thus the bound in Equation ([5]) can be multiplied by 1/(1 — 7). The proof is 
harder because an agent need not plan to make a mistake in all futu re time-steps when 
reaching state 1 before eventually doing so in one time-step. Note that IStrehl et al. tod ) 
proved their theorem for a specific class of policies while Theorem 1 1 5 1 holds for all policies. 



8 Conclusion 



Summary. We presented matching upper and lower bounds on the number of time-steps 
when a reinforcement learning algorithm can be nearly-optimal with high probability. 
While the lower bound is completely general, the upper bound depends on the assumption 
that there are at most two next-states for each state/action pair. This a ssump t ion a side, 
the new upper bound improves on the previously best known bound of Auer ( 201ll ). If 
the assumption is dropped then the new proof can be used to construct an algorithm that 
is better than the bound of Auer (|201lh in terms of 1/(1-7), but wor se in \S\. Th e lower 
bound, which comes without assumptions, improves on the work of IStrehl et all (bOQflh 
by being both larger and more general. The class of MDPs used for the counter-example 
do satisfy Assumption Q] and so the upper and lower bounds now match in this restricted 
case. 

Running Time. We did not analyze the runn i ng tim e of our version of UCRL, but expect 
analysis similar to that of Strehl and Littman ( 20081 ) can be used to show that UCRL can 
be approximated to run in polynomial time with no cost to sample-complexity. 
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A Proof of Lower PAC Bound 

The proof makes use of a simple form of bandit and Theorem [TBI which lower bounds 
the sample-complexity of bandit algorithms. We need some new notation required for 
non-stationary policies and bandits. 

History Sequences. We write si-t = si, s^, ■ ■ ■ ,st for the history sequence of length t. 
Histories can be concatenated, so si : t® = s%, S2, ■ ■ ■ , St, © where © G S. 

Bandits. An A-armed bandit is a vector p : A — >■ [0, 1]. A policy interacts with a 
bandit sequentially. In time-step t some arm at is played whereupon the policy receives 
reward 1 with probability p{a) and reward otherwise. This is repeated over all time- 
steps. More formally, a bandit policy is a function tt : {0, 1}* — > A. The optimal arm 
is defined a* := argmax a p(a). A policy dependent on e,5 and A has sample-complexity 
T := T(A, e, 5) if for all bandits the arm chosen on time-step T satisfies p(a*) —p{aT) < e 
with probability at least 1 — 5. 

Theorem 16 (Mannor and Tsitsiklis, 2004). There exist positive constants c\, C2, eo, an d 
5q, such that for every A > 2, e G (0,eo) and 5 G (0,<5o) there exists a bandit p G [0, 1] A 
such that 

T(A,e,5) > ^logy 

with probability at least 5. 
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Remark 17. The bandit used in the proof of Theorem [16] satisfies p(a) = \ for all a 
except a* which has p(a*) := \ + e. 

We now prepare to prove Theorem [15j For the remainder of this section let ir be 
an arbitrary policy and Mh ar( j be the MDP of Figure 2. As in previous work we write 
V w := V£ Ih . The idea of the proof will be to use Theorem 1161 to show that ir cannot be 
approximately correct in state 1 too often. Then use this to show that while in state 
before-hand it is also not approximately correct. 

Definition 18. Let si :00 € S°° be the sequence of states seen by policy ir and for arbitrary 
history s\-t let 

A( Sl .. t ) :=V*( S i :t )-V*(s 1:t ). 
Lemma 19. // 7 G (0, 1), p := 1/(2 — 7) and q := 2 — I/7 then 

1 00 \ 

P 4 ( 1 -t) > 3/4 and ^p*(l-p)7* = -. 

Proof sketch. Both results follow from the geometric series and easy calculus. ■ 

The following lemma lower-bounds A(si-t) if sub-optimal action a 7^ a* is taken in state 
1. 

Lemma 20. Let s\ : t be a history such that st = 1 and a := ir(si : t) 7^ a* then 

A(fli :t ) > 8e. 

Proof. The result essentially follows from the definition of the value function. 

A(si rt ) = V*(s 1:t ) - r(si :t ) 



= ^ [V*(s ht ®) - ^(si :t ©) + F*(s 1:t e) - V^(ai rt e)] + 7e(o*)^*(*we) 
> 8e, 

where we used the definition of the value function and MDP, Mh ar d • B 

We now define time- intervals where the policy is in state 0. Recall we chose the absorp- 
tion in this state such that the expected number of time-steps a policy remains there is 
approximately 1/(1 — 7). We define the intervals starting when a policy arrives in state 
and ending when it leaves to state 1. 

Definition 21. Define t\ := 1 and 

t° := min {t : t > A s t = A s t -i ^ 0} t] := min {t - 1 : s t = 1 A t > t°} . 

Define the intervals Ij := C N. We call interval Ii the ith phase. 

Note the following facts: 
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1. Since all transition probabilities are non-zero, and t\ exist for all i E N with 
probability 1. 

2. \Ii\ is the number of time-steps spent in state before moving to state 1. 

3. The values are independent of ir and each other. 

Definition 22. Suppose i £ N and St = and define the weight of action a, wt(a) by 

oo 

w t {a) := Y,P k (l-Ph k Msi-.tO k l) = aj. 

k=0 

Lemma 23. Ylta^A w t{ a ) = \ f or a ^ t where st = 0. 
Proof. We use Lemma PT9l 

oo 

J>i(a) = ^E^ 1 -P)7 fc [vr(^0 fc l) = a] 

aeA aeA fc=0 

x j 

= ^/(l-p)/ = 2 
fe=o 

as required. ■ 



Definition 24. Define random variables A{ and Xj by 

A l := > 1/[16(1 - 7 )] A ^ ^o(a) > 1/4] X, := [^l > 1/[4(1 - 7 )]] 

Intuitively, Xj is the event that the ith phase lasts at least 1/[4(1 — 7)] time-steps. A{ 
is the event that the ith phase lasts at least 1/[16(1 — 7)] time-steps and the combined 
weight of sub-optimal actions at the start of a phase is at least 1/4. The following lemma 
shows that at least two thirds of all phases have Xj = 1 with high probability. 

Lemma 25. For all n € N, P {E"=i X i < l n ) < 2e~ n/72 . 
Proof. Preparing to use Hoeffding's bound, 

P{Xi = 1} :=P{\h\ > 1/[4(1 -7)]} =p 1 /Wi-7)] > 3/4j 
where we used the definitions of Xj, Ii and Lemma [T9l Therefore EXj > 3/4. 

- 3^} ^ P {E X * ^ ^n + nBX^ =p|fjX i -EX < y^nj < 2e~ n / 72 
where we applied basic inequalities followed by Hoeffding's bound. ■ 

Lemma 26. If 7 > | and Yla^a* w t( a ) — \ then ^ a ^ a * wt+k{a) > | for all t G N and 
satisfying < k < 1/[16(1 — 7)]. 
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Proof. Working from the definitions. 



a^a* j=0 
k-1 

j=0 a^a* 
k-1 

< (i-p)E^V' +pV E M^o k ) 

j=0 a=£a* 

Rearranging, setting < k < 1/[16(1 — 7)] and using the geometric series completes the 
proof. ■ 

So far, none of our results have been especially surprising. Lemma [25] shows that at least 
two thirds of all phases have length exceeding 1/[4(1 — 7)] with high probability. Lemma 
shows that if at the start of a phase tt assigns a high weight to the sub-optimal actions, 
then it does so throughout the entire phase. The following lemma is more fundamental. It 
shows that the number of phases where tt assigns a high weight to the sub-optimal actions 
is of order e i^_^z log | with high probability. 

Lemma 27. Let N :~- 



^(1-7)^ ° g 



Y with constants as in Theorem 



then 



< i : Y w t° ( a ) > 7 A * < 2N + 1 \ 



> N 



with probability at least 5. 

The idea is similar to that in (jStrehl et all 120091 ). Assume a policy exists that doesn't 
satisfy the condition above and then use it to learn the bandit defined by p(a) := pf a . 

Proof. Let p(a) := pf a be a bandit and use 7T to learn bandit p using Algorithm 2 below, 
which returns an action ab es t defined as 



27V 

a hest : = arg max Y^, 

a . . 
t=l 



a; 



arg max«) ( o(o') 



By Theorem ll6l the strategy in Algorithm 2 must fail with probability at least 5. Therefore 
with probability at least 5, ab es t 7^ 0* . However abest is defined as the majority action of 
all the a~i and so for at least N time-steps a,\ ^ a*. Suppose w t o(a) > 4, then by Lemma 
|23| ^2 a -£ a * w t o(a) < j and a, = arg max a w t o (a) = a*. This implies that with probability 
5, for at least ./V time-steps Yla^a* w t°( a ) > 4 as required. ■ 
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Algorithm 2 Learn Bandit 



t = 1, s t = 0, k = 
loop 

if sj = 1 then 

r ~ p(cit) > sample from bandit 

if r = 1 then 

s t +i = e 

else 

st+i = 
fc = k + 1 
if k = 2N then 

abest = arg max a Ya=i l a = ar S max a' w t° ( a ')l 
exit 

else 

st+i ~ Ps t ,a t > sample from MDP 

Proof of Theorem [T5l Suppose Aj = 1 and < /c < 1/[16(1 — 7)] then s vt o +k = Si. f oO fc 

' i ' ' i 

and 

00 

A(si :ti+fc ) = -ph t A(s 1:ti+k t l) (6) 

t=o 
00 

> jZftl-pW E M'l^+fcO' 1 ) = «l 8e (7) 

> E ^o +fc (a)8e (8) 

> e, (9) 
where Equation ([6]) follows from the definition of and the value function. Equation 
(J7J by Lemma l20l Equation (jSj) by the definition of w^+kio) and Equation (jHI by Lemma 
[26l Thus for each i where Aj = 1, policy tt makes at least 1/[16(1 — 7)] e-errors. The proof 
is completed by showing that Aj = 1 for at least N/6 time-steps with probability at least 
5, which follows easily from Lemma [27] and Lemma [25l 

Dependence on S is added trivially by chaining arbitrarily many such Markov decision 
processes together. ■ 

Re mark 28. Depend ence on S log S can possibly be added by a similar technique used 
by IStrehl et al. but details could be messy. 



B Technical Results 

Theorem 29 (Hoeffding Inequality). Let X±, ■ ■ ■ ,X n be independent [0, l]-valued random 
variables with probability 1. Then 



< 2e~ 2e2n . 
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Theorem 30 (Bernstein's Inequality ( Bernstein . 19241 )). Let X\, ■ ■ ■ ,X n be independent 
real-valued random variables with zero mean and variance VarJQ = af. If \X^\ < c with 
probability one then 



1 n 
n f— ■ ' 



i=i 



> e > < 2e 2 CT ^+2 C£ /3 , 



where a 



2 1 NT^n 



En 
i=l °\ 



We can use Hoeffding and Bernstein to bound the gaps \p — p\ and \p — p\ we now want 
to combine these together in a nice way to bound \p — p\. 

Lemma 31. Let p,p,p € [0,1] satisfy 

\p — p\ < min{CIi,C7 2 } , 

where 

2 



/2p(l -p) 2 2 

C/i := a/ ^ log- + - log 

n o 3n o 



C/ 2 := 



1 i 2 
— log — . 

2n S 5 



Then 



\p — p\ < 



8p(l-p) 



. 1 2\3 4 , 2 
log — + 2 ( — log — + — log - 



n 



n 



3n 



Proof. Using the first confidence interval 



/ 2p(l-p) 1 2 2 2 

P - P < V ^g T + log -r 

V n o 3n o 

Assume without loss of generality that 1 — p > 1 — p (the case where p > p is identical. 
Therefore 



/2p(l-p). 2 /2(p-p)(l-p) 1 2 2 2 

P-P < \/— -log- + A/ ^log- + — l0g- 

n o V n o on o 



/2p(l-p), 2 

< t/-^ — log T + 

n o 



4 \/sk lo e| . 2 2.2 



J? 



l0g 5 + 3^ l0g ^ 



2p(l -p) ,2 i/l, 2\i 2,2 
''log -=+8* -log- + — log-, 



n 



5 



3n 5 



where we used the second confidence interval and algebra. Bounding [p — p| by the first 
confidence interval leads to 



\p — p\ < 



8p(l-p) ,2 /l, 2V 4,2 

^log- + 2 -log- +i r log- 

n o \n o I on o 



as required. 
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C Proof of Lemma [8] 

We need to define some higher "moments" of the value function. This is somewhat unfor- 
tunate as it complicates the proof, but may be unavoidable. 

Definition 32. We define the space of bounded value/reward functions 1Z by 

K(i) := { v G 0, ' ' 



1-7, 

Let 7r be some stationary policy. For G lZ(d) define values by the Bellman equations 

Vl(s) = r d ( S ) + 7 ^<X( S '). 

s' 

Additionally, 

al(sf := p^.vf-[p s ^-Vn 2 - 

Note that V d G TZ(d + 1) and o\ G TZ(2d + 2). Let r G TZ(0) be the true reward function 
ro(s) := r(s) and define a recurrence by r2d+2(s) := o"J(s) 2 . We define f^, fd, VJ', and 
cxj, <rj similarly but where all parameters have hat /tilde. 

The following lemma generalises Lemma [TUl 

Lemma 33. Let MgAij. at time-step t then 



3 



n t (s) \n t (s)J (1-7)^+1 3n t (s)(l - 7 ) d+1 

Proof. Drop references to it and let p := pf n , ft '■= ftf^ and n := n t (s). Since M, M G .Mfc 
then apply Lemma l3T1 to obtain 



l P - a <, «w-?) +2 W' + ^ 



Assume without loss of generality that Vd(sa + ) > V^(ai~). Therefore we have 

3 



|(P. |W " A,.) • V d \ < J S J^-Pl (V d (^) - V d (sa-)) + 2 ' ^ V ' 



n V y ' y ') \ n J (1 - 7) d +! 
4Li 



(10) 



3n(l -7) d + 1 ' 
where we used Assumption [1] and the fact that Vd G 

p(l -p) (V d (sa + ) - V d {scT)) 2 = p(l - p) (v^+) 2 + F d (*T) 2 - 2^(s*+)^(aT) 

= p^(sa + ) 2 + (1 - p)^(*T) 2 - (pV d (sa+) + (1 - p)V d (m-) 



2 



20 



Substituting into Equation (fTUj) completes the proof. ■ 
Proof of Lemma [8] For ease of notation we drop 7r and t super /subscripts. Let 



EH S ) -w(s)]r d (s) 
ses 



= \V d (s t )-V d (s t )\. 



Using Lemma [9] 



7 



^2w{s)(ps -p s ) ■ V d 



< 



< 



e 



4(1 - 7 )d 

e 

4(1-7) 



+ 



^2w{s)(p-p) ■ V d 
d +A d + B d + C d , 



where 



A d := ^2 w ( s "> i 



sex 



n(s) 



4Li 



sex 



3n(s)(l - 7 ) d + 



T Q:=J>( a )2 



s€X 



Li 

n(s) 



3/4 



The expressions B d and C d are substantially easier to bound than A d . First we give a 
naive bound on A d , which we use later. 



^ < E 



'8u;(s)a2(s)L 



n(s) 



E E 



l Sw{s)a 2 As)L l 



n(s) 



(11) 



£ E 



8Li|if(/c,t) 



itik 



E < E 

s&K(k,l) ft,te/CxX 



8L 



1 E 



w(s)a 2 d {s 



(12) 



< 



< 



" r "-'E E <»(««(') £ J^^^E^m (w) 



sex 



l/C x JILi 



m(l — 7) 



2d+3 ' 



(14) 



where in Equation (jlip we used the definitions of ^ and /C. In Equation (|12p we ap- 
plied Cauchy-Schwartz and the assumption that < k. In Equation f)13|) we used 
Cauchy- Schwartz again and the definition of /C. Finally we apply the trivial bound of 
w(s)a d (s) < 1/(1 — 7) 2d+3 . Unfortunately this bound is not sufficient for our needs. 
The solution is approximate w(s) by w(s) and use Lemma [11] to improve the last step 
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above. 



A H < 



/8|/C xl\L 



ni 



(15) 



ses 



ses 



in 



< 



8|/c x m I]Ll E * oow + 5> (s) " " (s)) ^ (s) (16) 

(17) 



se5 



8|/CxI|Li 8|/CxX|Li 

7i H A 2( 2+2, 

7l(l — -y) za + z m 



where Equation (|15p is as in the naive bound. Equation (|16p is substituting u>(s) for tt)(s) 
and Equation (|16p uses the definition of A. Therefore 



Aw < 



+ B d + C d + 



<%L\\K x Z| 



4(1 _ 7 )d 

Expanding the recurrence up to (3 leads to 



A < 8 Yl 

deT>-{/3} 



Lx\KxX 



\ d/(d+2) 



in 



4(1 - 7 ) c 



(1 _ 7 )2d+2 



+ 5 d + C d + 



+ \/ J — A 2 d +2 . 

771 



'Li|/C x X| 


1 






(1 _ 7 )2d+2 





2/(<2+2) 



+ 



Li|/C x X|\ 



/9/C/3+2) 



' L x \KxI\ 

m{l _ 7 ) 2/3+ 3 + B P + C ? 



2/(/3+2) 



(18) 



where we used the naive bound to control Ap. The bounds on B d and C d are somewhat 
easier, and follow similar lines to the naive bound on A d . 



Bd = E^ 5 

sex 



4Li 



3n(s)(l -7) d+1 3(l-7) d+1 ^ run 3m(l - 7 ) d+1 



\K(k,l) . 4|/CxX|L! 



w s 



sex 



U \* 1 



< 



( s )y (i- 7 ) d + 1 - (i - 7 )d+i+i/4 



\K xXlLiV 



in 



Letting m := 2 ®^^2+I/l completes the proof. 
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D Constants 



The proof of Theorem [3] uses many constants, which can be hard to keep track of. For 
convenience we list them below, including approximate upper/lower bounds as appropri- 
ate. 



Constant 



8\S\ 



log 2 iu& e(l-7) 2 
21^2 

\T>\ ■= \Z{fi)\ 
|/C| := \Z{\S\)\ 

■ — '•max ~\~ 1 

|/C x X| := |/C||X| 



<?1 := 



7 " ° e(l-7) 

._ e(l-7) 
— 4|5| 

2 1 5" X yl | U max 



Li := log 4- 



m : 



20Li|X:xI||X)| 2 



2( 1 _ 7 )2+2/ / 3 

TV := 6|5 x A|m 
L max := 4/V|/C x Z\ 
U max := \Sx A\\JCxl\ 



o/n 



01og- 



\s\ 



e(l-~f) 

01oglo gT jL 
01og|5| 
01og^ 



01og|5|log^ 



e(l- 7 ) 



T ^log^ 77 



-- 1 \s\ 



0- 



ISxAploglSllog^^ 
01og-^ 



<5e(l-7) 



0^1^ log ^ log \S\ log A log 2 log ^ 



-1^4, log l gxA l i™ t CI 1™ _JS 



log|5|log^log 2 lo gT ^ 



0|5x A|log|5|log 
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E Table of Notation 



S, A Finite sets of states and actions respectively. 

7 The discount fact. Satisfies 7 <G (0, 1). 

e The required accuracy. 

5 The probability that an algorithm makes more mistakes than its 

sample-complexity. 

N The natural numbers, starting at 0. 

log The natural logarithm. 

A,V Logical and/or respectively. 

EX, Var X The expectation and variance of random variable X respectively. 

Z^ Z^ . — 2 2 . 

2(a) Defined as a set of all Z{ up to and including a. Formally 

2(a) := {zi : i < argmhij {zj > a}}. Contains approximately 
log a elements. 

7T A policy. 

p The transition function, p : S x A x S — s> [0, 1]. We also write 

Ps a := p( s > a -> s> ) f° r the probability of transitioning to state s' 
from state s when taking action a. p s s ' n := Pg n ^ s y Ps,a £ [0, l]' 5 ' is 
the vector of transition probabilities. 

p,p Other transition probabilities, as above. 

r The reward function r : S — > A. 

M The true MDP. M := (S, A,p, r, 7). 

M The MDP with empirically estimated transition probabilities. 

M:= (S,A,p,r,j). 

M An MDP in the model class, M. M := (S,A,p,r,j). 

V^j The value function for policy it in MDP M. Can either be viewed 

as a function Vjfr : S -)■ R or vector G Rl s L 

V n , V 11 The values of policy ir in MDPs M and M respectively. 

7r* = ir* M The optimal policy in MDP M. 

7r* = 7T~ The optimal policy in M. 

tt* = 7r!j The optimal policy in M. 

7Tk The (stationary) policy at used in episode k. 

n t (s,a) The number of visits to state/action pair (s,a) at time-step t. 
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nt(s,a,s') The number of visits to state s' from state s when taking action 
a at time-step. 

nt(s) The number of visits to state/action pair (s,7Tj(s)) at time-step t. 

vt k (s,a) If tk is the start of an exploration phase then this is the total 

number of visits to state (s, a) in that exploration phase. 

St, at The state and action in time-step t respectively. 

VJ A higher "moment" value function. See Definition [32j 

o"J(s) 2 The variance of Vd(s') when taking action tt(s) in state s'. Defined 

in Definition 1321 

L\ Defined as log(2/<$i). 

V Defined as Z(/3). 

w t (s) The expected discounted number of visits to state s,iTk(s) while 

following policy ir^. 

X t The active set containing states s where w(s) > w m i n . 

K A set if indices, K, := ^(|5|). 

X A set of indices, Z := {0, 1, 2, • • • , i max }- 

K t (n, i) A set of states that have 

w t (s) G [w L , 2w L ) A n t (s) € m[ra ( , (2k + 2)w L ). 

Note that (J K t K t (K, l) contains all states with w(s) > w m i n . 
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